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BAYES AND EMPIRICAL-BAYES MULTIPLICITY ADJUSTMENT 
IN THE VARIABLE-SELECTION PROBLEM 

By James G. Scott 1 and James O. Berger 

University of Texas at Austin and Duke University 

This paper studies the multiplicity-correction effect of standard 
Bayesian variable-selection priors in linear regression. Our first goal 
is to clarify when, and how, multiplicity correction happens auto- 
matically in Bayesian analysis, and to distinguish this correction 
from the Bayesian Ockham's-razor effect. Our second goal is to con- 
trast empirical-Bayes and fully Bayesian approaches to variable selec- 
tion through examples, theoretical results and simulations. Consider- 
able differences between the two approaches are found. In particular, 
we prove a theorem that characterizes a surprising aymptotic dis- 
crepancy between fully Bayes and empirical Bayes. This discrepancy 
arises from a different source than the failure to account for hyper- 
parameter uncertainty in the empirical-Bayes estimate. Indeed, even 
at the extreme, when the empirical-Bayes estimate converges asymp- 
totically to the true variable-inclusion probability, the potential for a 
serious difference remains. 

1. Introduction. This paper addresses concerns about multiplicity in 
the traditional variable-selection problem for linear models. We focus on 
Bayesian and empirical-Bayesian approaches to the problem. These methods 
both have the attractive feature that they can, if set up correctly, account 
for multiplicity automatically, without the need for ad-hoc penalties. 

Given the huge number of possible predictors in many of today's scien- 
tific problems, these concerns about multiplicity are becoming ever more 
relevant. They are especially critical when researchers have little reason to 
suspect one model over another, and simply want the data to flag interest- 
ing covariates from a large pool. In such cases, variable selection is treated 
less as a formal inferential framework and more as an exploratory tool used 
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to generate insights about complex, high-dimensional systems. Still, the re- 
sults of such studies are often used to buttress scientific conclusions or guide 
policy decisions — conclusions or decisions that may be quite wrong if the 
implicit multiple-testing problem is ignored. 

Our first objective is to clarify how multiplicity correction enters Bayesian 
variable selection: by allowing the choice of prior model probabilities to de- 
pend upon the data in an appropriate way. Some useful references on this 
idea include Waller and Duncan (1969), Meng and Dempster (1987), Berry 
(1988), Westfall, Johnson and Utts (1997), Berry and Hochberg (1999) and 
Scott and Berger (2006). We also clarify the difference between multiplicity 
correction and the Bayesian Ockham's-razor effect [see Jefferys and Berger 
(1992)], which induces a very different type of penalty on model complex- 
ity. This discussion will highlight the fact that not all Bayesian analyses 
automatically adjust for multiplicity. 

Our second objective is to describe and investigate a peculiar discrepancy 
between fully Bayes and empirical-Bayes variable selection. This discrepancy 
seems to arise from a different source than the failure to account for uncer- 
tainty in the empirical-Bayes estimate — the usual issue in such problems. 
Indeed, even when the empirical-Bayes estimate converges asymptotically 
to the true hyperparameter value, the potential for a serious difference re- 
mains. 

The existence of such a discrepancy between fully Bayesian answers and 
empirical-Bayes answers — especially one that persists even in the limit — 
is of immediate interest to Bayesians, who often use empirical Bayes as a 
computational simplification. But the discrepancy is also of interest to non- 
Bayesians for at least two reasons. 

First, frequentist complete-class theorems suggest that if an empirical- 
Bayes analysis does not approximate some fully Bayesian analysis, then it 
may be suboptimal and needs alternative justification. Such justifications 
can be found for a variety of situations in George and Foster (2000), Efron 
et al. (2001), Johnstone and Silverman (2004), Bogdan, Ghosh and Zak- 
Szatkowska (2008), Cui and George (2008), Bogdan, Chakrabarti and Ghosh 
(2008) and Bogdan, Ghosh and Tokdar (2008). 

Second, theoretical and numerical investigations of the discrepancy re- 
vealed some unsettling properties of the standard empirical-Bayes analysis 
in variable selection. Of most concern is that empirical Bayes has the po- 
tential to collapse to a degenerate solution, resulting in an inappropriate 
statement of certainty in the selected regression model. As a simple exam- 
ple, suppose the usual variable-selection prior is used, where each variable is 
presumed to be in the model independently with an unknown common prob- 
ability p. A common empirical-Bayes method is to estimate p by marginal 
maximum likelihood (or Type-II maximum likelihood, as it is commonly 
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called; see Section 3.2). This estimated p is then used to determine the pos- 
terior probabilities of models. This procedure will be shown to have the 
startlingly inappropriate property of assigning final probability 1 to either 
the full model or the intercept-only (null) model whenever the full (or null) 
model has the largest marginal likelihood, even if this marginal likelihood is 
only slightly larger than that of the next-best model. 

This is certainly not the first situation in which the Type-II MLE ap- 
proach to empirical Bayes has been shown to have problems. But the un- 
usual character of the problem in variable selection seems not to have been 
recognized. 

In bringing this issue to light, our goal is not to criticize empirical- 
Bayes analysis per se. Indeed, this paper will highlight many virtues of the 
empirical-Bayes approach to variable selection, especially compared to the 
nonadaptive model prior probabilities that are often used for variable selec- 
tion. Our primary goal is comparative, rather than evaluative, in nature. In 
particular, we wish to explore the implications of the above discrepancy for 
Bayesians, who are likely to view empirical Bayes as an approximation to 
full Bayes analysis, and who wish to understand when the approximation 
is a good one. We recognize that others have alternative goals for empirical 
Bayes, and that these goals do not involve approximating full Bayes analysis. 
Also, there are non-Bayesian alternatives to marginal maximum likelihood 
in estimating p, as shown in some of the above papers. The results in this 
paper suggest that such alternatives be seriously considered by those wishing 
to adopt the empirical-Bayes approach, especially in potentially degenerate 
situations. 

Section 2 introduces notation. Section 3 gives a brief historical and method- 
ological overview of multiplicity correction for Bayesian variable selection, 
and focuses on the issue of clarifying the source and nature of the correction. 
Sections 4 and 5 introduce a theoretical framework for characterizing the 
differences between fully Bayesian and empirical-Bayes analyses, and gives 
several examples and theoretical results concerning the differences. Section 6 
presents numerical results indicating the practical nature of the differences, 
through a simulation experiment and a practical example. Section 7 gives 
further discussion of the results. 

2. Preliminaries. 

2.1. Notation. Consider the usual problem of variable selection in linear 
regression. Given a vector Y of n responses and an n x m design matrix X, 
the goal is to select k predictors out of m possible ones for fitting a model 
of the form 



(1) 



Yi = a + X ih P h + ■■■+ X ijk f3 jk + Ei 
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for some {ji, . . . ,jk} C {1, . . . , m}, where e, *~ ' N(0, i^" 1 ) for an unknown 
variance 

All models are assumed to include an intercept term a. Let Mq denote the 
null model with only this intercept term, and let Mp denote the full model 
with all covariates under consideration. The full model thus has parame- 
ter vector 6' = (a,/3 ), = (Pi, ■ ■ ■ ,P m )'- Submodels M 7 are indexed by a 
binary vector -f of length m indicating a set of fc 7 < m nonzero regression 
coefficients /3_,: 

JO, if ft = 0, 
ll \1, ifft^O. 

It is most convenient to represent model uncertainty as uncertainty in 
-f, a random variable that takes values in the discrete space {0, l}" 1 , which 
has 2 m members. Inference relies upon the prior probability of each model, 
p(M-y), along with the marginal likelihood of the data under each model: 

(2) f(Y | M 7 ) = J f(Y | 7 , ^)tt(0 7 , 0) dO^ d<f>, 

where ir(0j,(j)) is the prior for model-specific parameters. These together 
define, up to a constant, the posterior probability of a model: 

(3) p(M 7 | Y)cxp(M T )/(Y | M 7 ). 

Let X 7 denote the columns of the full design matrix X given by the 
nonzero elements of 7, and let X 7 denote the concatenation (1 X 7 ), where 
1 is a column of ones corresponding to the intercept a. For simplicity, we 
will assume that all covariates have been centered so that 1 and X 7 are 
orthogonal. We will also assume that the common choice n(a) = 1 is made 
for the parameter a in each model [see Berger, Pericchi and Varshavsky 
(1998) for a justification of this choice of prior]. 

Often all models will have small posterior probability, in which case more 
useful summaries of the posterior distribution are quantities such as the 
posterior inclusion probabilities of the individual variables: 

(4) pi = Pr( 74 ^ I Y) = hi=i ■ P( M f I Y )- 

7 

These quantities also define the median-probability model, which is the 
model that includes those covariates having posterior inclusion probability 
at least 1/2. Under many circumstances, this model has greater predictive 
power than the most probable model [Barbieri and Berger (2004)]. 
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2.2. Priors for model- specific parameters. There is an extensive body of 
literature confronting the difficulties of Bayesian model choice in the face 
of weak prior information. These difficulties arise due to the obvious de- 
pendence of the marginal likelihoods in (2) upon the choice of priors for 
model-specific parameters. In general, one cannot use improper priors on 
these parameters, since this leaves the resulting Bayes factors defined only 
up to an arbitrary multiplicative constant. 

This paper chiefly uses null-based ^-priors [Zellner (1986)] for computing 
the marginal likelihoods in (2); explicit expressions can be found in the 
Appendix. See Liang et al. (2008) for a recent discussion of g-priors, and 
mixtures thereof, for variable selection. 

3. Approaches to multiple testing. 

3.1. Bayes factors, Ockham's razor and multiplicity. In both Bayes and 
empirical-Bayes variable selection, the marginal likelihood contains a built-in 
penalty for model complexity that is often called the Bayesian "Ockham's- 
razor effect" [Jefferys and Berger (1992)]. This penalty arises in integrating 
the likelihood across a higher-dimensional parameter space under the more 
complex model, resulting in a more diffuse predictive distribution for the 
data. 

While this is a penalty against more complex models, it is not a multiple- 
testing penalty. Observe that the Bayes factor between two fixed models will 
not change as more possible variables are thrown into the mix, and hence 
will not exert control over the number of false positives as m grows large. 

Instead, multiplicity must be handled through the choice of prior proba- 
bilities of models. The earliest recognition of this idea seems to be that of 
Jeffreys in 1939, who gave a variety of suggestions for apportioning proba- 
bility across different kinds of model spaces [see Sections 1.6, 5.0 and 6.0 of 
Jeffreys (1961), a later edition]. Jeffreys paid close attention to multiplicity 
adjustment, which he called "correcting for selection." In scenarios involving 
an infinite sequence of nested models, for example, he recommended using 
model probabilities that formed a convergent geometric series, so that the 
prior odds ratio for each pair of neighboring models (i.e., those differing by 
a single parameter) was a fixed constant. Another suggestion, appropriate 
for more general contexts, was to give all models of size k a single lump of 
probability to be apportioned equally among models of that size. Below, in 
fact, the fully Bayesian solution to multiplicity correction will be shown to 
have exactly this flavor. 

It is interesting that, in the variable-selection problem, assigning all mod- 
els equal prior probability (which is equivalent to assigning each variable 
prior probability of 1/2 of being in the model) provides no multiplicity con- 
trol. This is most obvious in the orthogonal situation, which can be viewed 
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as m independent tests of Hi : = 0. If each of these tests has prior proba- 
bility of 1/2, there will be no multiplicity control as m grows. Indeed, note 
that this "pseudo-objective" prior reflects an a priori expected model size 
of m/2 with a standard deviation of y/m/2, meaning that the prior for the 
fraction of included covariates becomes very tight around 1/2 as m grows. 
See Bogdan, Ghosh and Tokdar (2008) for extensive discussion of this issue. 

3.2. Variable- selection priors and empirical Bayes. The standard mod- 
ern practice in Bayesian variable-selection problems is to treat variable in- 
clusions as exchangeable Bernoulli trials with common success probability 
p, which implies that the prior probability of a model is given by 

(5) p{M 1 \p)=p k -<{l-p) rn - k i 

with kj representing the number of included variables in the model. 

We saw above that selecting p = 1/2 does not provide multiplicity correc- 
tion. Treating p as an unknown parameter to be estimated from the data 
will, however, yield an automatic multiple-testing penalty. The intuition is 
that, as m grows with the true k remaining fixed, the posterior distribution 
of p will concentrate near 0, so that the situation is the same as if one had 
started with a very low prior probability that a variable should be in the 
model [Scott and Berger (2006)]. Note that one could adjust for multiplicity 
subjectively, by specifying p to reflect subjective belief in the proportion of 
variables that should be included. No fixed choice of p that is independent 
of m, however, can adjust for multiplicity. 

The empirical-Bayes approach to variable selection was popularized by 
George and Foster (2000), and is a common strategy for treating the prior 
inclusion probability p in (5) in a data-dependent way. The most common 
approach is to estimate the prior inclusion probability by maximum likeli- 
hood, maximizing the marginal likelihood of p summed over model space 
(often called Type-II maximum likelihood): 

(6) p = argmax V^p(M 7 | p) ■ f(Y | M 7 ). 

pe[o,i] 7 

One uses this in (5) to define the ex-post prior probabilities p(M 7 | p) = 
p k ~/(l _ j3) m - fc T 5 resulting in final model posterior probabilities 

(7) p(M 7 | Y) oc p k ~< ■ {l-p) m - k -f(Y | M 7 ). 

The EB solution p can be found either by direct numerical optimization or 
by the EM algorithm detailed in Liang et al. (2008). For an overview of 
empirical-Bayes methodology, see Carlin and Louis (2000). 

It is clear that the empirical-Bayes approach will control for multiplicity 
in a straightforward way: if there are only k true variables and m grows large, 
then p — > 0. This will make it increasingly more difficult for all variables to 
overcome the ever-stronger prior bias against their relevance. 
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Fig. 1. Prior model probability versus model size. 



3.3. A fully Bayesian version. Fully Bayesian variable-selection priors 
have been discussed by Ley and Steel (2009), Cui and George (2008) and 
Carvalho and Scott (2009), among others. These priors assume that p has a 
Beta distribution, p~Be(a,6), giving 

, . f 1 , , . , . , B(a + k~, b + m — k~) 
(8) p(M 7 ) = / p(M 7 | p)ir(p) dp = ^- 



o 



0(a,b) 

where /?(-,•) is the beta function. For the default choice of a = b = 1, implying 
a uniform prior on p, this reduces to 

l9j PW" (m + i) (m! ) m + i^fc 



We call these expressions deriving from the uniform prior on p the "fully 
Bayes" version of variable selection priors, though of course many other 
priors could be used (including those incorporating subject-area informa- 
tion). Utilizing these prior probabilities in (3) yields the following posterior 
probabilities: 

(10) KM,|Y)oc^(™) _1 /(Y|M 7 ). 

This has the air of paradox: in contrast to (7), where the multiplicity ad- 
justment is apparent, here p has been marginalized away. How can p then 
be adjusted by the data so as to induce a multiplicity-correction effect? 

Figures 1 and 2 hint at the answer, which is that the multiplicity penalty 
was always in the prior probabilities in (9) to begin with; it was just hidden. 
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In Figure 1 the prior log-probability is plotted as a function of model size 
for a particular value of m (in this case 30). This highlights the marginal 
penalty that one must pay for adding an extra variable: in moving from 
the null model to a model with one variable, the fully Bayesian prior favors 
the simpler model by a factor of 30 (label A). This penalty is not uniform: 
models of size 9, for example, are favored to those of size 10 by a factor of 
only 2.1 (label B). 

Figure 2 then shows these penalties getting steeper as one considers more 
models. Adding the first variable incurs a 30-to-l prior-odds penalty if one 
tests 30 variables (label A as before), but a 60-to-l penalty if one tests 60 
variables. Similarly, the lOth-variable marginal penalty is about two-to-one 
for 30 variables considered (label B), but would be about four-to-one for 60 
variables. 

We were careful above to distinguish this effect from the Ockham's-razor 
penalty coming from the marginal likelihoods. But marginal likelihoods are 
clearly relevant. They determine where models will sit along the curve in 
Figure 1, and thus will determine whether the prior-odds multiplicity penalty 
for adding another variable to a good model will be more like 2, more like 
30 or something else entirely. Indeed, note that, if only large models have 
significant marginal likelihoods, then the "multiplicity penalty" will now 
become a "multiplicity advantage," as one is on the increasing part of the 
curve in Figure 1. (This is also consistent with the empirical-Bayes answer: 
if p > 0.5, then the analysis will increase the chance of variables entering the 
model.) 

Interestingly, the uniform prior on p also gives every variable a marginal 
prior inclusion probability of 1/2; these marginal probabilities are the same 
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Table 1 

Posterior inclusion probabilities (x 100) for the 10 real variables in the simulated data, 
along with the number of false positives (posterior inclusion probability greater than 1/2) 
from the "pure noise" columns in the design matrix. Marginal likelihoods were calculated 

(under Zellner-Siow priors) by enumerating the model space in the m — 11 and m — 20 
cases, and by 5 million iterations of the feature-inclusion stochastic-search algorithm 
[Berger and Molina (2005), Scott and Carvalho (2008)] in the m = 50 and m = 100 cases 
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as those induced by the "pseudo-objective" choice of p = 1/2. Yet because 
probability is apportioned among models in a very different way, profoundly 
different behaviors emerge. 

For example, Table 1 compares these two regimes on a simulated data 
set for which the true value of k was fixed at 10. The goal of the study 
is, in essence, to understand how posterior probabilities adapt to situations 
of increasingly egregious "data dredging," where a set of true covariates is 
tested in the presence of an ever-larger group of spurious covariates. We 
used a simulated m = 100 design matrix of N(0, 1) covariates and 10 regres- 
sion coefficients that differed from zero, along with 90 coefficients that were 
identically zero. The table summarizes the posterior inclusion probabilities 
of the 10 real variables as we test them along with an increasing number of 
noise variables (first 1, then 10, 40 and 90). It also indicates how many false 
positives (defined as having posterior inclusion probability > 0.5) are found 
among the noise variables. Here, "uncorrected" refers to giving all models 
equal prior probability by setting p = 1/2. "Oracle Bayes" is the result from 
choosing p to reflect the known fraction of nonzero covariates. 

The following points can be observed: 

• The fully Bayes and empirical Bayes procedures both exhibit clear multi- 
plicity adjustment: as the number of noise variables increases, the poste- 
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Table 2 

Posterior inclusion probabilities for the important main effects, 
quadratic effects and cross-product effects for ozone-concentration 
data under g-priors. Key: p — 1/2 implies that all models have equal 
prior probability; FB is fully Bayes; EB is empirical Bayes 
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0.38 
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0.34 


0.36 


0.29 
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0.74 


0.77 
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0.20 


0.03 


0.05 


xlO 


0.96 


0.96 


0.97 


xl-xl 


1.00 


0.97 


0.99 


x9-x9 


0.95 


0.82 


0.91 


xl-x2 


0.48 


0.16 


0.24 


x4-x7 


0.33 


0.10 


0.15 


x6-x8 


0.43 


0.25 


0.34 


x7-x8 


0.31 


0.13 


0.18 


x7-xl0 


0.71 


0.86 


0.85 



rior inclusion probabilities of variables decrease. The uncorrected Bayesian 
analysis shows no such adjustment and can, rather bizarrely, sometimes 
have the posterior inclusion probabilities increase as noise variables are 
added. 

• On the simulated data, proper multiplicity adjustment yields reasonably 
strong control over false positives, in the sense that the number of false 
positives appears bounded (and small) as m increases. In contrast, the 
number of false positives appears to be increasing linearly for the uncor- 
rected Bayesian analysis, as would be expected. 

• The full Bayes, empirical Bayes and oracle Bayes answers are all qual- 
itatively (though not quantitatively) similar; indeed, if one adopted the 
(median probability model) prescription of selecting those variables with 
posterior inclusion probability greater than 1/2, they would both always 
select the same variables, except in two instances. 

The differences between corrected and uncorrected analyses are quite 
stark, and calls into question the use of nonadaptive priors in situations 
with large numbers of potentially spurious covariates. For example, Table 2 
shows the posterior inclusion probabilities for a model of ozone concentra- 
tion levels outside Los Angeles that includes 10 atmospheric variables along 
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with all squared terms and second-order interactions (m = 65). Probabili- 
ties are given for uncorrected (p = 1/2), empirical Bayes and fully Bayesian 
analyses. All variables appear uniformly less impressive when adjusted for 
multiplicity. 

Other examples of such multiplicity correction put into practice can be 
found throughout the literature. For nonparametric problems, see Gopalan 
and Berry (1998); for gene-expression studies, see Do, Muller and Tang 
(2005); for econometrics, see Ley and Steel (2009); for Gaussian graphical 
models, see Carvalho and Scott (2009); and for time-series data, see Scott 
(2009). 

4. Theoretical comparison of Bayes and empirical Bayes. 

4.1. Motivation. The previous section showed some examples where fully 
Bayes and empirical-Bayes methods gave qualitatively similar results. While 
this rough correspondence between the two approaches does seem to hold in 
a wide variety of applied problems, we now turn attention to the question 
of when, and how, it fails. 

We begin with a surprising lemma that indicates the need for caution 
with empirical-Bayes methods in variable selection. The lemma refers to 
the variable-selection problem, with the prior variable inclusion probabil- 
ity p being estimated by marginal (or Type-II) maximum likelihood in the 
empirical-Bayes approach. 

Lemma 4.1. In the variable-selection problem, if Mq has the (strictly) 
largest marginal likelihood, then the Type-II MLE estimate of p is p = 0. 
Similarly, if Mp has the (strictly) largest marginal likelihood, then p = 1. 

Proof. Since p(M^) sums to 1 over 7, the marginal likelihood of p 
satisfies 

(11) /(Y) = ]T/(Y I M>(M 7 ) < max/(Y | M 7 ). 

r 7 

Furthermore, the inequality is strict under the conditions of the lemma (be- 
cause the designated marginals are strictly largest), unless the prior as- 
signs p(Mj) = 1 to the maximizing marginal likelihood. The only way that 
p{M-y) = p k ~< • (1 — p) m_fc T can equal 1 is for p to be or 1 and for the 
model to be Mq or Mp, respectively. At these values of p, equality is indeed 
achieved in (11) under the stated conditions, and the results follow. □ 

As a consequence, the empirical-Bayes approach here would assign final 
probability 1 to Mq whenever it has the largest marginal likelihood, and 
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final probability 1 to Mp whenever it has the largest marginal likelihood. 
These are clearly very unsatisfactory answers. 

The above lemma does highlight a specific, undesirable property of the 
empirical-Bayes approach to variable selection — one whose practical signifi- 
cance we investigate by simulation in Section 6. For the most part, however, 
the rest of our results are of a fundamentally different character. We will not 
be evaluating either the fully Bayes or the empirical-Bayes approach accord- 
ing to an objective yardstick, such as how well each one does at recovering 
true relationships or suppressing false ones. Instead, we focus on compar- 
ing the two approaches to each other in a more formal way. As mentioned 
above, our fundamental goal is to understand when, and how, empirical 
Bayes corresponds asymptotically to full Bayes analysis. Such a compari- 
son is certainly of interest, both to Bayesians who might consider empirical 
Bayes as a computational approximation, and to frequentists for the reasons 
mentioned in the Introduction. 

To explore the difference between these two approaches, it is useful to 
abstract the problem somewhat and suppose simply that the data Y have 
sampling density f(Y \ 6), and let 9 E O have prior density ir(0 | A) for 
some unknown hyperparameter A € A. Empirical-Bayes methodology typi- 
cally proceeds by estimating A from the data using a consistent estimator. 
[The Type-II MLE approach would estimate A by the maximizer of the 
marginal likelihood m(Y | A) = J A /(Y | 6)ir(0 \ \)d9, and this will typi- 
cally be consistent in empirical-Bayes settings.] It is then argued that (at 
least asymptotically) the Bayesian analysis with A will be equivalent to the 
Bayesian analysis if one knew A. (This claim is most interesting when the 

prior for A is unknown; if it is known, then there are also strong frequentist 
reasons to use this prior in lieu of empirical Bayes.) 

To contrast this with a full Bayesian analysis, suppose we have a prior 
density 7r(A) for A and a target function ip(6, Y | A). For instance, ip could 
be the posterior mean of given A and Y, or it could be the conditional 
posterior distribution of 6 given A and Y. The empirical-Bayesian claim, in 
this context, would be that 



that is, that the full Bayesian answer on the left can be well approximated by 
the empirical-Bayes answer on the right. The justification for (12) would be 
based on the fact that, typically, 7r(A | Y) will collapse to a point mass 
near the true A as the sample size increases, so that (12) will hold for 
appropriately smooth functions ip{6,~Y | A) when the sample size is large. 

There are typically better approximations to the left-hand side of (12), 
such as the Laplace approximation. These, however, are focused on repro- 
ducing the full-Bayes analysis through an analytic approximation, and are 



(12) 
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not "empirical-Bayes" per se. Likewise, higher-order empirical-Bayes anal- 
ysis will likely yield better results here, but the issue is in realizing when 
one needs to resort to such higher-order analysis in the first place, and in 
understanding why this is so for problems such as variable selection. 

That (12) could fail for nonsmooth tp(9, Y | A) is no surprise. But what 
may come as a surprise is that this failure can also occur for very common 
functions. 

Most notably, it fails for the conditional posterior density itself. Indeed, 
in choosing ip(0,Y | A) = ir(0 | A, Y), the left-hand side of (12) is just the 
posterior density of 6 given Y, which (by definition) can be written as 

(13) 7Tp(0 | Y) oc /(Y | 6) [ 7r(0 I A)7r(A)dA. 

J A 

On the other hand, for this choice of ip, (12) becomes 

(14) 7r E (0 I Y) « ir(0 | Y, A) oc f(Y | 6) ■ tt(0 | A), 

and the two expressions on the right-hand sides of (13) and (14) can be 
very different. [This difference may not matter, of course; for instance, if 
/(Y | 6) is extremely concentrated as a likelihood, the prior being used may 
not matter.] 

As an indication as to what goes wrong in (12) for this choice of tp, note 
that 

7Tf(0|Y)= / tt(0| A,Y)-vr(A| Y)d\ 
J A 

(15) =f <d [ X [V-^\Y)dX 



J A tt(A|Y) 

(16) =Ja /(ymaiy) -^^^ 

Of course, these elementary calculations simply lead to (13) after further 
algebra. But they illuminate the fact that, while 7r(A | Y) may indeed be 
collapsing to a point mass at the true A, this term occurs in both the nu- 
merator and the denominator of the integrand and therefore cancels. The 
accuracy with which a point mass at A approximates 7r(A | Y) is thus essen- 
tially irrelevant from the standpoint of full Bayes analysis. 

4.2. Comparison using Kullback-Leibler convergence. Our goal, then, is 
to understand when (13) and (14) will yield the same answers, in an asymp- 
totic sense. The closeness of these two distributions will be measured by 
Kullback-Leibler divergence, a standard measure for comparing a pair of 
distributions P and Q over parameter space O: 
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Kullback-Leibler divergence can be used to formalize the notion of empirical- 
Bayes convergence to fully Bayesian analysis as follows: 

KL empirical- B ayes convergence. Suppose the data Y and parameter 9 
have joint distribution p(Y, 9 | A), where 9 £ is of dimension m, and where 
A € A is of fixed dimension that does not grow with m. Let tte = tt{iP{9) \ 
Y, A) be the empirical-Bayes posterior distribution for some function of 
the parameter ip(0), and let tt f = tt(V>(0) | Y) = J a tt(i/j(9) \ Y, A) • n(X)dX 
be the corresponding fully Bayesian posterior under the prior vr(A). If, for 
every A G A, KL(7rp || tte) — > in probability [expectation] under p(Y,9 \ 
A) as in — y oo, then tt f will be said to be KL-convergent in probability 
[expectation] to the fully Bayesian posterior tt f . 

Note that KL convergence is defined with respect to a particular function 
of the parameter, along with a particular prior distribution on the hyperpa- 
rameter. The intuition is the following. Suppose that in trying to estimate 
a given function tp(9), it is possible to construct a reasonable prior 7r(A) 
such that the KL-convergence criterion is met. Then the empirical Bayes 
and full Bayes analysis will disagree for every finite sample size, but are at 
least tending toward agreement asymptotically. If, on the other hand, it is 
not possible to find a reasonable prior vr(A) that leads to KL convergence, 
then estimating ip(9) by empirical Bayes is dubious from the fully Bayesian 
perspective. A Bayesian could not replicate such a procedure even asymp- 
totically, while a frequentist may be concerned by complete-class theorems. 
(A "reasonable" prior is a necessarily vague notion, but obviously excludes 
things such as placing a point mass at A.) 

Instead of KL divergence, of course, one might instead use another dis- 
tance or divergence measure. The squared Hellinger distance is one such 
possibility: 

K 2 (P\\Q) = \ J^VW)-VW)?d9. 

Most of the subsequent results, however, use KL divergence because of its 
familiarity and analytical tractability. 

4.3. An orthogonal example. As a simple illustration of the above ideas, 
consider the following two examples of empirical-Bayes analysis. The first 
example satisfies the convergence criterion; the second does not. Both exam- 
ples concern the same sampling model, in which we observe a series of con- 
ditionally independent random variables yi ~ N(#j, 1), and where we know 
that 6i ~ N(/x, 1). Thus, the hyperparameter A = fi here. Let 9 = (6i, . . . , 6 m ) 
and y = (yi,...,y m ). 

Alternatively, this can be thought of as an orthogonal regression problem 
where both the dimension and number of samples are growing at the same 
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rate: y = X6 + e, with X being the m x m identity matrix. This framing 
makes the connection to variable selection much more plain. 

The natural empirical-Bayes estimate of fi is the sample mean £ie = y, 
which is clearly consistent for \i as m — > oo and converges at the usual 1 / y/m 
rate. A standard hyperprior in a fully Bayesian analysis, on the other hand, 
would be \x ~ N(0, A) for some specified A; the objective hyperprior n(fi) = 1 
is essentially the limit of this as A — > oo . Using the expressions given in, for 
example, Berger (1985), the empirical-Bayes and full Bayes posteriors are 

(18) J r E (S|y,/i E ) = N(i(y + 51),il), 

(19) M« I y) = N(i(y + si) - (^>L 51 + ^ty^'))- 

where I is the identity matrix and 1 is a column vector of all ones. 

Example 1. Suppose only the first normal mean, 0\, is of interest, 
meaning that the target function ip{0) = 9\. Then sending A — > oo yields 

(20) 7T E (6 1 | y, pi E ) = N([ yi + y}/2, 1/2), 

(21) 7r F (#i | y) = N([yi + y}/2, 1/2 + [2m]" 1 ). 

It is easy to check that KL(ttf \\ we) — > as m — > oo. Hence, tte(0i) arises 
from a KL-convergent EB procedure under a reasonable prior, since it cor- 
responds asymptotically to the posterior given by the objective prior on the 
hyperparameter /i. 

Example 2. Suppose now that 0, the entire vector of means, is of inter- 
est [hence, ip(6) = 0]. The relevant distributions are then the full tie and ttf 
given in (18) and (19), with parameters (0e,^e) and (0f,£f), respectively. 

A straightforward computation shows that KL(-7Tf || tte) is given by 



KL — - 

2 

(22) 

1 



( +tr(£ E 1 £ F ) + (0£-0 F ) t ££ 1 (0 £ - 

2 



/ mA \ mA ( 1 

log 1 H H h 2m — y l 

S V mA + 2) mA + 2 \mA + 2J y 

For any nonzero choice of A and for any finite value of the hyperparameter 
/i, it is clear that under p(y, \ fi) the quantity [2m/ (mA + 2) 2 ] • y 2 — > in 
probability as m — > oo. Hence, for any value of A (including A = oo), the 
KL divergence in (22) converges to (1 — log2)/2 > 0. 

Of course, this only considers priors of the form /i~N(0,A), but the 
asymptotic normality of the posterior for \x can be used to prove the result for 
essentially any prior that satisfies the usual regularity conditions, suggesting 
that there is no reasonable prior for which tte(0) is KL-convergent. 
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The crucial difference here is that, in the second example, the parameter of 
interest increases in dimension as information about the hyperparameter jj, 
accumulates. This is not the usual situation in asymptotic analysis. Hence, 
even as Op and Op are getting closer to each other elementwise, the KL 
divergence does not shrink to as expected. 

Two further comments are in order. First, a similar argument shows that 
the fully Bayes posterior is not KL-convergent to the so-called "oracle poste- 
rior" 7r(0 I y, ht) — that is, the conditional posterior distribution for 6, given 
the true value of \i. This is not a source of worry for Bayesians, but it makes 
clear that the disagreement between EB and FB procedures cuts both ways, 
and is not merely a "failure" of empirical-Bayes; if a non-Bayesian's goal is to 
reconstruct the oracle posterior, this could be achieved by empirical-Bayes 
analysis but not by full Bayes. 

Second, the situation described above has the sample size n equal to 
the number of unknown parameters m. If n grows relative to m, the full 
Bayes and empirical-Bayes/oracle posteriors can indeed be KL-convergent. 
For instance, suppose there are r independent replicate observations for each 
Hi. Then a similar calculation shows that KL(7rp || tte) = 0(l/r) as r — > oo, 
so that KL convergence between the two approaches would obtain. 

5. Results for variable selection. For the variable-selection problem, ex- 
plicit expressions for the KL divergence between empirical-Bayes and fully 
Bayes procedures are not available. It is also quite difficult to characterize 
the sampling distribution of p, the empirical-Bayes estimate for the prior in- 
clusion probability p. It is therefore not yet possible to give a general char- 
acterization of whether, and when, the empirical-Bayes variable-selection 
procedure is KL-convergent, in the sense defined above, to a fully Bayesian 
procedure. 

Three interesting sets of results are available, however. First and most 
simply, we can characterize the KL divergence between the prior probability 
distributions of the fully Bayesian and empirical-Bayesian procedures. Sec- 
ond, we can characterize the limiting expected Kullback-Leibler divergence 
between EB and FB posteriors, even if we cannot characterize the limiting 
KL divergence itself. Third, we can compare the asymptotic behavior of the 
full Bayes and empirical-Bayes prior model probabilities for models in a size 
neighborhood of the true model. 

We denote the empirical-Bayes prior distribution over model indicators 
by pe(M~[) and the fully-Bayesian distribution (with uniform prior on p) 
by pf{M 1 ). Similarly, after observing data D, we write pe(M 7 | Y) and 
Pf(M^ I Y) for the posterior distributions. 

5.1. Prior KL divergence. The first two theorems prove the existence of 
lower bounds on how close the EB and FB priors can be, and show that 
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these lower bounds become arbitrarily large as the number of tests m goes 
to infinity. We refer to these lower bounds as "information gaps," and give 
them in both Kullback-Leibler (Theorem 5.1) and Hellinger (Theorem 5.2) 
flavors. 

Theorem 5.1. Let G(m) = min^KL(pj?(M 7 ) ||p B (M 7 )). ThenG(m)^ 
oo as m — > oo . 



Proof. The KL divergence is 



KL =E 



i 



m + 1 



log 



k=0 

(23) =-log(m + l) 



1 



m + 1 \ k 



71) 



io g (p fe -(i-pr- fc ) 



m—k^ 



1 m 

+i ^ 

fc=0 



log ( ^ ) + k logp + (m — k) log(l — p) 



This is minimized for p=l/2 regardless of m, meaning that 



(24) 



G(m) = -log(m + l)-— ^—rjt log (7) 



log( k ) +mlog(l/2) 



mlog2 - log(m + 1) TtX^ 1o s(^) • 

171 k=o ^ ' 



The first (linear) term in (24) dominates the second (logarithmic) term, 
whereas results in Gould (1964) show the third term to be asymptotically 
linear in m with slope 1/2. Hence, G(m) grows linearly with m, with asymp- 
totic positive slope of log 2 — 1/2. □ 



Theorem 5.2. LetR 2 (m) =mm f) ~R 2 {p F (M 1 ) ||p B (M 7 )). ThenR 2 ( 



m) 



1 as m — )■ oo . 
Proof. 

(25) H 2 (p F (M~)\\p E (MJ)) = l 



j5 fc (l — j3) 



m—k 



This distance is also minimized for p = 1/2, meaning that 
(26) 



H 2 (m) = l- (m + 1)- 1 / 2 - 2-/2.^ JW 

fc=o V \ ' 
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A straightforward application of Stirling's approximation to the factorial 
function shows that 



(27) lim 



m— >oo 



.+ 1)- 1/2 -2-»' 2 -E\/(T) 

fc=0 V \ / 



0. 



from which the result follows immediately. □ 



In summary, the ex-post prior distribution associated with the EB pro- 
cedure is particularly troubling when the number of tests m grows without 
bound. On the one hand, when the true value of k remains fixed or grows at 
a rate slower than m — that is, when concerns over false positives become the 
most trenchant, and the case for a Bayesian procedure exhibiting strong mul- 
tiplicity control becomes the most convincing — then p — > and the EB prior 
PE(M-y) becomes arbitrarily bad as an approximation to pf(M 7 ). (Here, the 
correction under the empirical-Bayes approach will be more aggressive com- 
pared with the Bayesian approach, and some may consider this additional 
aggressiveness to be a source of strength.) On the other hand, if the true 
k is growing at the same rate as m, then the best one can hope for is that 
p = 1/2. And even then, the information gap between pp{M-y) and pe(M~) 
grows linearly without bound (for KL divergence), or converges to 1 (for 
Hellinger distance). 

5.2. Posterior KL divergence. We now prove a theorem showing that, 
under very mild conditions, the expected KL divergence between FB and EB 
posteriors for the variable-selection problem is infinite. This version assumes 
that the error precision (ft is fixed, but the generalization to an unknown (ft 
is straightforward. 

Theorem 5.3. In the variable- selection problem, let m, n > m, and 
(ft > be fixed. Suppose X 7 is of full rank for all models and that the family of 
priors for model- specific parameters, {tt(/3_,)}, is such that p((3^ = 0) < 1 for 
all M 7 . Then, for any true model M'E , the expected posterior KL divergence 
E[KL{p^(M 7 | Y) || pe{M^ I Y)}] under this true model is infinite. 

Proof. The posterior KL divergence is 

(28) KL(p F (M 7 I Y) \\p E (M^ | Y)) = £>(M 7 I Y)-log( ^|^j^ . 

This is clearly infinite if there exists a model M 7 for which pe(M~ | Y) = 
but pf(M^ I Y) > 0. Since the fully Bayesian posterior assigns nonzero prob- 
ability to all models, this condition is met whenever the empirical-Bayesian 
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solution is p = or p = 1. Thus, it suffices to show that p will be with 
positive probability under any true model. 

Assume without loss of generality that = 1. Recall that we are also 
assuming that 7t(q) = 1 for all models, and that the intercept is orthogonal 
to all other covariates. Letting /3 7 * = (a,/3 7 ) 4 for model My, and letting 
L(-) stand for the likelihood, the marginal likelihood for any model can then 
be written 

(29) f(Y | M y ) = 109*) • / 5 (A>09 7 ) d/3 
where 

ff (/3 T ) = exp{-i(/3 7 - ( 3 7 )'X 7 X 7 (/3 7 - 7 )}. 
The Bayes factor for comparing the null model to any model is 

_ /(Y | Mq) 
B ^ Y) -f(Y\MyY 

which from (29) is clearly continuous as a function of Y for every 7. Eval- 
uated at Y = (so that /3 7 then equals 0), this Bayes factor satisfies 

(30) By(0)= ^exp|-i/3 7 X 7 X 7 /3 7 |^(/3 7 )(i/3 7 ^ >1 

for each My under the assumptions of the theorem. 

By continuity, for every model My there exists an e 7 such that 2? 7 (Y) > 1 
for any |Y| < e~. Let e* = min 7 e 7 . Then for Y satisyfing |Y| < e* , J3 7 (Y) > 
1 for all nonnull models, meaning that Mq will have the largest marginal 
likelihood. By Lemma 4.1, p = when such a Y is observed. 

But under any model, there is positive probability of observing |Y| < e* 
for any positive e* , since this set has positive Lebesgue measure. Hence, 
regardless of the true model, there is positive probability that the KL diver- 
gence KL(pp(My I Y) \\pE(My I Y)) is infinite under the sampling distribu- 
tion p(Y I My), and so its expectation is clearly infinite. □ 

Since the expected KL divergence is infinite for any number m of variables 
being tested, and for any true model, it is clear that E(KL) does not converge 
to as m — > 00 . This, of course, is a weaker conclusion than would be a lack 
of KL convergence in probability. 

In Theorem 5.3 the expectation is with respect to the sampling distri- 
bution under a specific model My, with /3 either fixed or marginalized 
away with respect to a prior distribution. But this result implies an infi- 
nite expectation with respect to other reasonable choices of the expectation 
distribution — for example, under the Bernoulli sampling model for 7 in (5) 
with fixed prior inclusion probability p. 
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5.3. Asymptotic behavior of prior model probabilities under EB and FB 
procedures. While interesting, the results in the previous two sections do 
not consider the usual type of asymptotic comparison, namely, how do the 
full Bayes and empirical Bayes posterior distributions converge as m — > oo? 
It is not clear that such asymptotic comparisons are possible in general, 
although very interesting results can be obtained in particular contexts [cf. 
Bogdan, Chakrabarti and Ghosh (2008), Bogdan, Ghosh and Tokdar (2008)]. 

A rather general insight related to such comparison can be obtained, 
however, by focusing on the prior probabilities of "high posterior" models, 
as m — > oo. To do so, we first need an approximation to the full Bayes prior 
probability of M 7 , given in the following lemma. The proof is straightforward 
Laplace approximation, and is omitted. 

Lemma 5.4. As m—>oo, consider models of size fc 7 such that kj/rn is 
bounded away from and 1. Then the Bayesian prior probability of M 7 with 
prior ir(p) is 



providing 7r(-) is continuous and nonzero. 

Now suppose pt is the true prior variable inclusion probability and con- 
sider the most favorable situation for empirical Bayes analysis, in which the 
empirical Bayes estimate for px satisfies 



It is not known in general when this holds, but it does hold in exchangeable 
contexts where each variable is in or out of the model with unknown prob- 
ability p, since such problems are equivalent to mixture model problems. 

For models far from the true model, the prior model probabilities given by 
the Bayesian and empirical Bayesian approaches can be extremely different. 
Hence, it is most interesting to focus on models that are close to the true 
model for the comparison. In particular, we restrict attention to models 
whose size differs from the true model by 0(y/m). 

Theorem 5.5. Suppose the true model size kx satisfies kT/m = pT + 
0(l/y/m) as m— > oo, where < px < 1. Consider all models M 7 such that 





x{l + o(l)} 



(31) 
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kx — k~f = 0(y/rn), and consider the optimal situation for EB in which (31) 
holds. Then the ratio of the prior probabilities assigned to such models by 
the Bayes approach and the empirical Bayes approach satisfies 

Pe{M^) \m J \ m J \ m J \ m J \m J 

X m~ 1/2 {l + o(l)} 
x ((p)^(l-p) m -^) _1 

-°(-r' 

providing ir(-) is continuous and nonzero. 

Proof. Note that 
(32) p E (M^) = {pp(l-p) m - k - = { PT (l +e E )} k -<{l - PT (1 + e E )} m - k -. 
Taking the log and performing a Taylor expansion yields 
logp B (M 7 ) = log{^7(l -ptT 1 -^} + fey log (1 + e E ) 

+ (m- fc 7 )log|l - g g 

= log{pj7(l - Pr ) m - fc -} + A; 7 { ££ ; + 0(4)} 
+ (m - /c 7 )<| - n — — + 0(e|; 



(1-Pt) 

= log{p*7(l -pT) m "^} + {^T + 0(^)}{^ + 0(4)} 

+ { m _ fc r - (Vm)}\- n PT r e E + 0(4)1 

= log{ P7 7(l -ptV-^} + 0(v^e B ) + 0(m4) 
= log{p k T ->(l-p T r- k -} + 0(l). 

A nearly identical argument using Lemma 5.4 shows that the log Bayesian 
prior probability for these models is 

(33) log{p F (M 7 )} =log{pj7(l -PT) m ~^} -lo gV ^ + 0(l), 
from which the result is immediate. □ 



So we see that, even under the most favorable situation for the empirical- 
Bayes analysis, and even when only considering models that are close to the 
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true model in terms of model size, the prior probabilities assigned by the 
Bayes approach are smaller by a factor of order 1 / \fm than those assigned 
by the empirical-Bayes approach. The effect of this very significant difference 
in prior probabilities will be context dependent, but the result does provide 
a clear warning that the full Bayes and empirical-Bayes answers can differ — 
even when m — > oo and even when there is sufficient information in the data 
to guarantee the existence of a consistent estimator for pr- 

The theorem also shows that the empirical-Bayes procedure provides a 
better asymptotic approximation to the "oracle" prior probabilities, which 
may be argued by some to be the main goal of empirical-Bayes analysis. At 
least for this ideal scenario, the EB approach assigns larger prior probabili- 
ties to models which are closer to the true model. Of course, this fact is not 
especially relevant from the fully Bayesian perspective, and does not neces- 
sarily counterbalance the problems associated with ignoring uncertainty in 
the estimator for p. 

Finally, this difference in prior probabilities will not always have a large 
effect. For instance, if n— > oo at a fast enough rate compared with m, then 
the Bayes and empirical-Bayes approach will typically agree simply because 
all of the posterior mass will concentrate on a single model [i.e., one of the 
marginal likelihoods f(Y | M 7 ) will become dominant], and so the assigned 
prior probabilities will be irrelevant. 

6. Numerical investigation of empirical-Bayes variable selection. This 
section presents numerical results that demonstrate practical, finite-sample 
significance of some of the qualitative differences mentioned above. As in 
the previous section, most of the investigation is phrased as a comparison of 
empirical-Bayes and full Bayes, taken from the fully Bayesian perspective. 

Note that m here is taken to be moderate (14 for the simulation study 
and 22 for the real data set); the intent is to focus on the magnitude of 
the difference that one can expect in variable selection problems of such 
typical magnitude. Of course, such m are not large enough that one would 
automatically expect the empirical-Bayes approach to provide an accurate 
estimate of p, and so differences are to be expected, but it is still useful to 
see the magnitude of the differences. For a larger m situation, see Table 1; 
for the largest m in that table, the full Bayes and empirical-Bayes answers 
are much closer. The rationale for taking these values of m is that they allow 
the model space to be enumerated, avoiding potential confounding effects 
due to computational difficulties. 

6.1. Results under properly specified priors. The following simulation 
was performed 75,000 times for each of four different sample sizes: 

1. Draw a random mxn design matrix X of independent N(0, 1) covariates. 
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Fig. 3. Distribution of p in the simulation study (n = 60) with a correctly specified (uni- 
form) prior for p. The gray bars indicated the number of times, among values of p in the 
extremal bins, that the empirical-Bayes solution collapsed to the degenerate p — Q orp=\. 



2. Draw a random p~ U(0,1), and draw a sequence of m independent 
Bernoulli trials with success probability p to yield a binary vector 7 
encoding the true set of regressors. 

3. Draw (3^, the vector of regression coefficients corresponding to the nonzero 
elements of 7, from a Zellner-Siow prior. Set the other coefficients /3_ 7 
to 0. 

4. Draw a random vector of responses Y ~ N(X/3,I). 

5. Using only X and Y, compute marginal likelihoods (assuming Zellner- 
Siow priors) for all 2 m possible models; use these quantities to compute 
p along with the EB and FB posterior distributions across model space. 

In all cases m was fixed at 14, yielding a model space of size 16,384 — large 
enough to be interesting, yet small enough to be enumerated 75,000 times in 
a row. We repeated the experiment for four different sample sizes (n = 16, 
n = 30, n = 60 and n = 120) to simulate a variety of different m/n ratios. 

Two broad patterns emerged from these experiments. 

First, as Figure 3 shows, the EB procedure gives the degenerate p = 
or p = 1 solution much too often. When n = 60, for example, almost 15% 
of cases collapsed to p = or p = 1 . This is essentially the same fraction of 
degenerate cases as when n = 16, which was 16%. This suggests that the 
issues raised by Theorem 5.3 can be quite serious in practice, even when n 
is large compared to m. 

Second, even in nondegenerate situations, the two procedures often reached 
very different conclusions about which covariates were important. Figure 4 
shows frequent large discrepancies between the posterior inclusion proba- 
bilities given by the EB and FB procedures. This happened even when n 
was relatively large compared to the number of parameters being tested, 
suggesting that even large sample sizes do not render a data set immune to 
this difference. (Note that Figure 4 only depicts the differences that arise 
when the empirical-Bayes solution does not collapse to either or 1.) 
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Fig. 4. Differences in all m inclusion probabilities between EB and FB analyses across 
all nondegenerate cases (i.e., where the EB solution does not collapse to the boundary). 
The percentage of points lying outside the boxplot whiskers (1.5 times the inter- quartile 
range) are as follows: 14% for n — 16, 12% for n = 30, 8% for n = 60 and 7% for n = 120. 



6.2. Results under improperly specified priors. The previous section 
demonstrated that significant differences can exist between fully Bayesian 
and empirical-Bayes variable selection in finite-sample settings. There was 
an obvious bias, however, in that the fully Bayesian procedure was being 
evaluated under its true prior distribution, with respect to which it is nec- 
essarily optimal. 

It is thus of interest to do a similar comparison for situations in which 
the prior distribution is specified incorrectly: the fully Bayesian answers will 
assume a uniform prior p, but p will actually be drawn from a nonuniform 
distribution. We limit ourselves to discussion of the analogue of Figure 3 for 
various situations, all with m = 14 and n = 60. Three different choices of the 
true distribution for p were investigated, again with 75,000 simulated data 
sets each: 

1. p ~ Be(3/2, 3/2), yielding mainly moderate (but not uniform) values of 
p. 

2. p ~ Be(l, 2), yielding mainly smaller values of p. 

3. p ~ 0.5 • Be(l/2, 8) + 0.5 • Be(8, 1/2), yielding primarily values of p close 
to or 1. 

The results are summarized in Figure 5. In each case the central pane 
shows the true distribution of p, with the left pane showing the Bayesian 
posterior means under the uniform prior and the right pane showing the 
empirical-Bayes estimates p. 

As expected, the incorrectly specified Bayesian model tends to shrink 
the estimated values of p back to the prior mean of 0.5. This tendency is 
especially noticeable in Case 3, where the true distribution contains many 
extreme values of p. This gives the illusion that empirical-Bayes tends to do 
better here. 
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Fig. 5. Distribution ofp (n = 6Q) in different versions of the simulation study, where the 
fully Bayesian model had a misspecified (uniform) prior on p. The gray bars indicated the 
number of times, among values ofp in the extremal bins, that the empirical-Bayes solution 
collapsed to the degenerate p = or p = 1. 

Notice, however, the gray bars in the right-most panes. These bars indicate 
the percentage of time, among values of p that fall in the left- or right-most 
bins of the histogram, that the empirical-Bayes solution is exactly or 1, 
respectively. For example, of the roughly 20,000 times that p£ [0,0.1) in 
Case 2, it was identically more than 10,000 of those times. (The fully 
Bayesian posterior mean, of course, is never exactly or 1.) 
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The bottom panel of Figure 5 shows that, paradoxically, where the fully 
Bayesian model is most incorrect, its advantages over the empirical-Bayes 
procedure are the strongest. In the mixture model giving many values of p 
very close to or 1, empirical Bayes collapses to a degenerate solution nearly 
half the time. Even if the extremal model is true in most of these cases, 
recall that the empirical-Bayes procedure would result in an inappropriate 
statement of certainty in the model. Of course, this would presumably be 
noticed and some correction would be entertained, but the frequency of 
having to make the correction is itself worrisome. 

In these cases, while the fully Bayesian posterior mean is necessarily 
shrunk back to the prior mean, this shrinkage is not very severe, and the 
uniform prior giving rise to such shrinkage can easily be modified if it is be- 
lieved to be wrong. And in cases where the uniform prior is used incorrectly, 
a slight amount of unwanted shrinkage seems a small price to pay for the 
preservation of real prior uncertainty. 

6.3. Results when p is fixed. We conducted a final version of the simula- 
tion with p fixed at 3 different values: p = 0.10, p = 0.25, and p = 0.5. Figure 
6 plots the estimated values of p under the fully Bayes and empirical-Bayes 
procedures. (For the sake of visual clarity only the results from 2000 data 
sets are shown.) 

It is clear that for the smallest value of p = 0.1, the degenerate solution 
p = occurs quite frequently. When p is moderate (as in the 0.25 or 0.5 
cases), degeneracy occurs much less often. 

It is also interesting to see the differences in how well the EB and FB anal- 
ysis approximate the "oracle" inclusion probabilities, which are the posterior 
inclusion probabilities one would compute if one knew the true Bernoulli 
probability p. This can be measured by looking at the t\ distance from the 
oracle estimate: 

rn 

i=i 

where p° r is the oracle posterior inclusion probability for the jth variable. 

The two procedures do quite similarly here, but with subtle differences. 
For example, on the "sparse" (p = 0.1) case, the mean l\ distance to the 
oracle answer across all Monte Carlo draws was 0.36 for the EB posterior, 
and 0.40 for the FB posterior. Yet the median i\ distance to the oracle 
answer was 0.27 for the FB posterior, and 0.30 for the EB posterior. 

These differences were largely consistent across other values of p. This 
suggests that, while the FB procedure seems to reconstruct the oracle pos- 
terior inclusion probabilities better for a larger number of data sets (such 
as when the empirical-Bayes answer is degenerate), it tends to miss by a 
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Fig. 6. Distribution of p (n = 6Q) in the fixed-p versions of the simulation study (2000 
subsamples of the fake data sets). The dashed line indicates the true value of p. 
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larger amount than the EB procedure does. This results in a worse level 
of average performance for the FB procedure in reconstructing the oracle 
posterior inclusion probabilities. 

6.4. Example: Determinants of economic growth. The following data set 
serves to illustrate the differences between EB and FB answers in a scenario 
of typical size, complexity and m/n ratio. 

Many econometricians have applied Bayesian methods to the problem 
of GDP-growth regressions, where long-term economic growth is explained 
in terms of various political, social and geographical predictors. Fernandez, 
Ley and Steel (2001) popularized the use of Bayesian model averaging in the 
field; Sala-i Martin, Doppelhofer and Miller (2004) used a Bayes-like proce- 
dure called BACE, similar to BIC-weighted OLS estimates, for selecting a 
model; and Ley and Steel (2009) considered the effect of prior assumptions 
(particularly the pseudo-objective p = 1/2 prior) on these regressions. 

We study a subset of the data from Sala-i Martin, Doppelhofer and Miller 
(2004) containing 22 covariates on 30 different countries. A data set of this 
size allows the model space to be enumerated and the EB estimate p to 
be calculated explicitly, which would be impossible on the full data set. 
The 22 covariates correspond to the top 10 covariates flagged in the BACE 
study, along with 12 others chosen uniformly at random from the remaining 
candidates. 

Summaries of exact EB and FB analyses (with Zellner-Siow priors) can 
be found in Table 3. Two results are worth noting. First, the EB inclusion 
probabilities are nontrivially different from their FB counterparts, often dis- 
agreeing by 10% or more. 

Second, if these are used for model selection, quite different results would 
emerge. For instance, if median-probability models were selected (i.e., one 
includes only those variables with inclusion probability greater than 1/2), 
the FB analysis would include the first four variables (and would almost 
choose the fifth variable), while the EB analysis would select only the first 
two variables (and almost the third). While we would not endorse simply 
choosing a model here, note that doing so would result in fundamentally 
different economic pictures for the FB and EB analysis. 

7. Summary. This paper started out as an attempt to more fully un- 
derstand when, and how, multiplicity correction automatically occurs in 
Bayesian analysis, and to examine the importance of ensuring that such 
multiplicity correction is included. That the correction can only happen 
through the choice of appropriate prior probabilities of models seemed to 
conflict with the intuition that multiplicity correction occurs through data- 
based adaptation of the prior-inclusion probability p. 



BAYES AND EMPIRICAL-BAYES MULTIPLICITY ADJUSTMENT 



29 



The resolution to this conflict — that the multiplicity correction is indeed 
pre-fixed in the prior probabilities, but the amount of correction employed 
will depend on the data — led to another conflict: how can the empirical- 
Bayes approach to variable selection be an accurate approximation to the 
full Bayesian analysis? Indeed, we have seen in the paper that empirical- 
Bayes variable selection can lead to results quite different than those from 
the full Bayesian analysis. This difference was evidenced through examples 
(both simple pedagogical examples and a more realistic practical example), 
through simulation studies, and through information-based theoretical re- 
sults. These studies, as well as the results about the tendency of empirical- 
Bayes variable selection to choose extreme p, all supported the general con- 
clusions about empirical-Bayes variable selection that were mentioned in the 
Introduction. 

APPENDIX: VARIATIONS ON ZELLNER'S G-PRIOR 

Conventional variable-selection priors rely upon the conjugate normal- 
gamma family of distributions, which yields closed- form expression for the 



Table 3 

Exact inclusion probabilities for 22 variables in a linear model for GDP growth among 

a group of 30 countries 



Covariate 


Fully Bayes 


Emp. Bayes 


East Asian dummy 


0.983 


0.983 


Fraction of tropical area 


0.727 


0.653 


Life expectancy in 1960 


0.624 


0.499 


Population density coastal in 1960s 


0.518 


0.379 


GDP in 1960 (log) 


0.497 


0.313 


Outward orientation 


0.417 


0.318 


Fraction GDP in mining 


0.389 


0.235 


Land area 


0.317 


0.121 


Higher education 1960 


0.297 


0.148 


Investment price 


0.226 


0.130 


Fraction Confucian 


0.216 


0.145 


Latin American dummy 


0.189 


0.108 


Ethnolinguistic fractionalization 


0.188 


0.117 


Political rights 


0.188 


0.081 


Primary schooling in 1960 


0.167 


0.093 


Hydrocarbon deposits in 1993 


0.165 


0.093 


Fraction spent in war 1960-1990 


0.164 


0.095 


Defense spending share 


0.156 


0.085 


Civil liberties 


0.154 


0.075 


Average inflation 1960-1990 


0.150 


0.064 


Real exchange rate distortions 


0.146 


0.071 


Interior density 


0.139 


0.067 
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marginal likelihoods. To give an appropriate scale for the normal prior de- 
scribing the regression coefficients, Zellner (1986) suggested a particular form 
of this family: 



(/SI^-N^o^X'X)- 1 ). 



Ga(-,- 
1 2' 2 



with prior mean (3 , often chosen to be 0. The conventional choice g = 
n gives a prior covariance matrix for the regression parameters equal to 
the unit Fisher information matrix for the observed data X. This prior 
can be interpreted as encapsulating the information arising from a single 
observation under a hypothetical experiment with the same design as the 
one to be analyzed. 

Zellner's g-prior was originally formulated for testing a precise null hy- 
pothesis, Hq:(3 = (3 , versus the alternative, Ha-P S W. But others have 
adapted Zellner's methodology to the more general problem of testing nested 
regression models by placing a flat prior on the parameters shared by the two 
models and using a g-prior only on the parameters not shared by the smaller 
model. This seems to run afoul of the general injunction against improper 
priors in model selection problems, but can nonetheless be formally justi- 
fied by arguments appealing to othogonality and group invariance; see, for 
example, Berger, Pericchi and Varshavskiy (1998) and Eaton (1989). These 
arguments apply to cases where all covariates have been centered to have a 
mean of zero, which is assumed without loss of generality to be true. 

A full variable-selection problem, of course, involves many nonnested com- 
parisons. Yet Bayes factors can still be formally defined using the "encom- 
passing model" approach of Zellner and Siow (1980), who operationally de- 
fine all marginal likelihoods in terms of Bayes factors with respect to a base 
model Mb- 

Since the set of common parameters which are to receive improper priors 
depends upon the choice of base model, different choices yield a different 
ensemble of Bayes factors and imply different "operational" marginal likeli- 
hoods. And while this choice of Mb is free in principle, there are only two 
such choices which yield a pair of nested models in all comparisons: the null 
model and the full model. 

In the null-based approach, each model is compared to the null model 
consisting only of the intercept a. This parameter, along with the precision 
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4>, is common to all models, leading to a prior specification that has become 
the most familiar version of Zellner's g-prior: 

(a,(f> | 7) oc 1/cf), 

This gives a simple expression for the Bayes factor for evaluating a model 
7 with k regression parameters (excluding the intercept): 

(35) BF(M 7 : M ) = (1 + g)^- k ^)/ 2 [\ + (1 _ R^g]'^- 1 )/ 2 , 

where R 2 , £ (0, 1] is the usual coefficient of determination for model M 7 . 

Adherents of the full-based approach, on the other hand, compare all 
models to the full model, on the grounds that the full model is usually 
much more scientifically reasonable than the null model and provides a more 
sensible yardstick [Casella and Moreno (2002)]. This comparison can be done 
by writing the full model as 

M F :Y = X 7 7 + X_ 7 /3_ 7) 

with the design matrix partitioned in the obvious way. Then a g-prior is 
specified for the parameters in the full model not shared by the smaller 
model, which again has k regression parameters excluding the intercept: 

(a, Ay, 01 7) oc l/<p, 

(/3_ 7 1 4>, 7) ~ n(o, |(x'_ 7 x_ 7 )-^ . 

This does not lead to a coherent "within-model" prior specification for 
the parameters of the full model, since their prior distribution depends upon 
which submodel is considered. Nevertheless, marginal likelihoods can still be 
consistently defined in the manner of (34). Conditional upon g, this yields 
a Bayes factor in favor of the full model of 

(36) BF(M F : M 7 ) = (1 + 5 )( m " m - 1 )/ 2 (l + g W)- {n - k - 1)/2 , 

where W = {l- R 2 F )/{1- R 2 ). 

The existence of these simple expressions has made the use of g-priors 
very popular. Yet g-priors yield display a disturbing type of behavior often 
called the "information paradox." This can be seen in (35): the Bayes factor 
in favor of M 7 goes to the finite constant {l + g) n ~ m ~ l as i? 7 — > 1 (which can 
only happen if M 7 is true and the residual variance goes to 0). For typical 
problems this will be an enormous number, but still quite a bit smaller than 
infinity. Hence, the paradox: the Bayesian procedure under a g-prior places 
an intrinsic limit upon the possible degree of convincingness to be found in 
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the data, a limit which is confirmed neither by intuition nor by the behavior 
of the classical test statistic. 

Liang et al. (2008) detail several versions of information-consistent g- 
like priors. One way is to estimate g by empirical-Bayes methods [George 
and Foster (2000)]. A second, fully Bayesian, approach involves placing a 
prior upon g that satisfies the condition J£°{1 + g) n ~ k ^~ 1 'x(g) dg = oo for 
all k-y < p, which is a generalization of the condition given in Jeffreys (1961) 
(see Chapter 5.2, equations 10 and 14). 

This second approach generalizes the recommendations of Zellner and 
Siow (1980), who compare models by placing a flat prior upon common 
parameters and a g-like Cauchy prior on nonshared parameters: 

(37) (^i^-c^x^r 1 )- 

These have come to be known as Zellner-Siow priors, and their use can 
be shown to resolve the information paradox. Although they do not yield 
closed- form expressions for marginal likelihoods, one can exploit the scale- 
mixture-of-normals representation of the Cauchy distribution to leave one- 
dimensional integrals over standard g-prior marginal likelihoods with respect 
to an inverse-gamma prior, g ~ IG(l/2,2/n). The Zellner-Siow null-based 
Bayes factor under model M 7 then takes the form 

roc 

BF(M 7 :M )= / (l+g) (n -^- 1)/2 [l + (l-i^) 5 ]-(«-i)/2 

(38) J ° 

xg^ 2 exp(-n/(2g))dg. 
A similar formula exists for the full-based version: 

POD 

BF(M F :M 7 )= / {l + g)^ n - m - l) l 2 [l + Wg\- {n - k - 1)/2 

, , Jo 
(39) 

xg- 3 / 2 eM-n/{2g))dg 

with W given above. 

These quantities can be computed by one-dimensional numerical integra- 
tion, but in high-dimensional model searches this will be a bottleneck. Luck- 
ily there exists a closed-form approximation to these integrals first noted in 
Liang et al. (2008). It entails computing the roots of a cubic equation, and 
extensive numerical experiments show the approximation to be quite accu- 
rate. These Bayes factors seem to offer an excellent compromise between 
good theoretical behavior and computational tractability, thereby overcom- 
ing the single biggest hurdle to the widespread practical use of Zellner-Siow 
priors. 
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